Skip to content

Runtime/namespace/client wide worker heartbeat#983

Merged
yuandrew merged 21 commits intotemporalio:worker-heartbeatfrom
yuandrew:process-wide-worker-heartbeat-2
Sep 24, 2025
Merged

Runtime/namespace/client wide worker heartbeat#983
yuandrew merged 21 commits intotemporalio:worker-heartbeatfrom
yuandrew:process-wide-worker-heartbeat-2

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Aug 18, 2025

NOTE: this targets a worker-heartbeat feature branch to merge into, that way I can make incremental progress and not hit main until the whole feature is ready for both server and SDK.

What was changed

Worker heartbeat duration is a setting configured with RuntimeOptions, but internally the heartbeat mechanism lives on the client. This way we can properly replace a client the SharedNamespaceWorker uses when replacing the client of a regular worker.

A follow up PR will address filling out the rest of the heartbeat data, most of the remaining pieces require an implementation of storing metrics in memory so we can pull that data on each heartbeat.

Why?

Worker heartbeat. I separated WorkerHeartbeat out of #962. Some of the design for this is with Worker Commands in mind, but that will come at a later time.

Checklist

  1. Closes

  2. How was this tested:

fixed up worker_heartbeat unit test

verified with prints (printed SDK name and version) that replace_client works with heartbeating, but couldn't come up with a good unit/integration test to verify this behavior

  1. Any docs updates needed?

@yuandrew yuandrew requested a review from a team as a code owner August 18, 2025 22:34
cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR titled "process-wide" but actually it's "runtime-wide" it seems. But IMO it should be at the client level, where you have the connection to actually make the invocation and is the thing workers share to communicate to server.

Comment thread core-c-bridge/src/runtime.rs Outdated
Comment thread core-c-bridge/src/runtime.rs Outdated
Comment thread core/src/lib.rs Outdated
Comment thread core/src/lib.rs Outdated
@yuandrew yuandrew changed the title Process-wide worker heartbeat Runtime-wide worker heartbeat Aug 19, 2025
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the client/runtime thing, I think this is pretty preferential and an internal detail that won't matter much to users.

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps). They are used when creating a worker, but the relationship is more like a worker "has" a client, and more than one worker may have the same client.

That's why I don't really want the heartbeating duration to be a client property. Right now our client options in the lang layer don't directly reference anything worker specific (they do indirectly with interceptors/plugins), and I think it should really stay that way. Specifying a heartbeating duration for a client that is never used with a worker, for example, is weird.

So, that means either this option stays on a Runtime (seems perfectly fine to me) or needs to be passed in when initting the worker, but then there's some last-write-wins problem if users use the same client with different values when passing to workers. So, more reason to put it on the Runtime.

As for literally where this map lives or not - I don't have a huge preference either way. Changing the trait that was used to get slot suppliers for eager to be more generic & used for this I can see. I'm also fine with the current setup.

Comment thread core/src/worker/heartbeat.rs Outdated
Comment thread core/src/worker/heartbeat.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread core/src/lib.rs Outdated
Comment thread core/src/lib.rs Outdated
@cretz
Copy link
Copy Markdown
Contributor

cretz commented Aug 20, 2025

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps)

Right, they are associated with them. That's the point. Same for eager workflow start. A worker is associated with a single client, not some other client in the process/runtime.

@Sushisource - to confirm, you don't believe that we need to make sure a worker's heartbeat should occur on the worker's client? I think that is much simpler to understand for those that want to understand what calls use what clients. This is important because users control lifetimes of clients and handle explicit client replacement to update auth and are the ones that control which clients are for which workers.

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

Comment thread core/src/worker/mod.rs Outdated
@Sushisource
Copy link
Copy Markdown
Member

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

Sure, I think that's fair (though the actual scenario in the current design where that wouldn't happen seems to be exceptionally rare). I maintain the config option shouldn't live in the client options, though.

@cretz
Copy link
Copy Markdown
Contributor

cretz commented Aug 21, 2025

I maintain the config option shouldn't live in the client options, though.

Do you believe I as a user need to make a whole new Tokio thread pool if I want to disable worker heartbeating only for some workers? Not a rhetorical question, I don't mind if the answer is "yes, should be rare".

@Sushisource
Copy link
Copy Markdown
Member

I maintain the config option shouldn't live in the client options, though.

Do you believe I as a user need to make a whole new Tokio thread pool if I want to disable worker heartbeating only for some workers? Not a rhetorical question, I don't mind if the answer is "yes, should be rare".

Technically you wouldn't even have to, right now (at least how this is exposed in Core - you might have to depending on how lang does it). You can make a new CoreRuntime without constructing a new Tokio runtime, but rather re-using an existing one.

So, we can support that. But, yeah even if we don't I'm not hugely concerned.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@yuandrew yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from 9d39b9f to e08370e Compare September 8, 2025 23:52
@yuandrew yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from e08370e to d324044 Compare September 9, 2025 00:02
cursor[bot]

This comment was marked as outdated.

@yuandrew yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from 198f5b8 to 0fc4e05 Compare September 9, 2025 06:25
Comment thread client/src/lib.rs
Comment thread client/src/lib.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread core/src/lib.rs
/// Optional worker heartbeat interval - This configures the heartbeat setting of all
/// workers created using this runtime.
#[builder(default = Some(Duration::from_secs(60)))]
heartbeat_interval: Option<Duration>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's unnecessary to put this on the runtime instead of the client, but it's not a big deal. It disallows hybrid users (e.g. users that connect to self hosted and cloud) from disabling/adjusting heartbeats specific to their environment. From a lang POV they'll have to create an entirely new thread pool to be able to do so. It seems like the only benefit to putting this value on runtime instead of client is "feels" IIUC.

Having said that, not a big deal, up to y'all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.

The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.

One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.

Comment thread core/src/lib.rs Outdated
@yuandrew yuandrew changed the title Runtime-wide worker heartbeat Runtime/namespace/client wide worker heartbeat Sep 10, 2025
Copy link
Copy Markdown
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks great to me. Only concern is enabling it for everyone by default.

Comment thread core/src/lib.rs Outdated
telemetry_options: TelemetryOptions,
/// Optional worker heartbeat interval - This configures the heartbeat setting of all
/// workers created using this runtime.
#[builder(default = Some(Duration::from_secs(60)))]
Copy link
Copy Markdown
Contributor

@cretz cretz Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we shouldn't turn on worker heartbeating by default at this time. IMO we should get it into langs, make sure it works as we expect, has no averse effects, etc then consider turning it on by default. Arguably I should be allowed to turn it off/on by client as mentioned at #983 (comment), but that's not a big deal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going into a feature branch, I'm turning it on for simplicity, but agree we should not turn on by default in main until everything is known to work as expected

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but, the goal is to have it on by default when released.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immediately upon release? Meaning there's no period where we try it out as opt-in in lang first? When we did this before, users saw unintentional logs and such. I figure we should at least try it in langs before turning it on for everyone in langs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.

Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.

Comment thread core/src/worker/heartbeat.rs
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I really like the way this is looking despite a bajillion comments. Mostly just polish stuff really. A few logic questions.

Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Comment thread core/src/lib.rs Outdated
Comment thread core/src/lib.rs Outdated
telemetry_options: TelemetryOptions,
/// Optional worker heartbeat interval - This configures the heartbeat setting of all
/// workers created using this runtime.
#[builder(default = Some(Duration::from_secs(60)))]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but, the goal is to have it on by default when released.

Comment thread core/src/lib.rs
/// Optional worker heartbeat interval - This configures the heartbeat setting of all
/// workers created using this runtime.
#[builder(default = Some(Duration::from_secs(60)))]
heartbeat_interval: Option<Duration>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.

The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.

One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.

Comment thread sdk/src/lib.rs Outdated
Comment thread tests/integ_tests/metrics_tests.rs Outdated
r.expect_sdk_name_and_version()
.returning(|| ("test-core".to_string(), "0.0.0".to_string()));
r.expect_get_identity()
r.expect_identity()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Mock Setup Errors in WorkerClient

The worker_set_key() method, recently added to the WorkerClient trait, has issues in its mock setups. In mock_worker_client(), the expect_worker_set_key() is incorrectly configured, passing a function pointer where a closure is expected. Additionally, mock_manual_worker_client() is missing this expectation entirely, which will cause a runtime panic if the method is called.

Fix in Cursor Fix in Web

self.all_workers
.insert(worker.worker_instance_key(), worker);

Ok(())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Worker Registration Inconsistency Causes Heartbeat Issues

The ClientWorkerSetImpl::register method can leave an inconsistent state. slot_providers is updated before the worker is fully registered in all_workers. If a later step, like heartbeat setup, fails, try_reserve_wft_slot may attempt to use a non-existent worker, preventing re-registration for that namespace/task queue and permanently losing the worker's heartbeat callback.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking like we're good to go here to me I think. Just a few small things, need to fix these lints, and we're ready! Nice

Comment thread client/src/worker_registry/mod.rs Outdated
#[cfg(test)]
fn num_providers(&self) -> (usize, usize) {
(self.index.len(), self.providers.len())
(self.slot_providers.len(), self.slot_providers.len())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to return the same number twice here, the few places tests use this can just get updated.

Comment thread client/src/worker_registry/mod.rs Outdated
Comment on lines +184 to +185
/// For slot managing, there can only be one worker registered per
/// namespace+queue_name+client, others will get ignored.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be changed to say this is an error now

Comment thread core/src/worker/mod.rs Outdated
trace_subscriber: Option<Arc<dyn Subscriber + Send + Sync>>,
}

struct WorkerHeartbeatManager {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move this down to live next to the registrator

Comment thread core/src/lib.rs Outdated
telemetry_options: TelemetryOptions,
/// Optional worker heartbeat interval - This configures the heartbeat setting of all
/// workers created using this runtime.
#[builder(default = Some(Duration::from_secs(60)))]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.

Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.

Comment thread core/src/worker/mod.rs
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@yuandrew yuandrew merged commit 54e2411 into temporalio:worker-heartbeat Sep 24, 2025
17 checks passed
yuandrew added a commit that referenced this pull request Oct 20, 2025
…nt level (#1038)

* Runtime/namespace/client wide worker heartbeat (#983)

* worker heartbeat

* Address Spencer's comments

* wip use client_identity_override as part of key, added test

* Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker

* Verified client replacement works, need to update tests and cleanup

* formating

* clean up

* forgot to remove new() now that using builder pattern

* Switch to worker_set_key

* Replace client test passes, need to write unit tests in worker_registry

* cargo test-lint

* limit nexus to 1 poller, add tests for worker_registry for heartbeat

* PR comments

* new test helper

* Return error on multi worker register for same namespace and task queue on same client

* cargo fmt

* Fix registration order, unique task queue for test worker

* Remove TEST_Q variable

* Missing quotes

* CI lint and docker test fix, rename worker_set_key to worker_grouping_key

* clippy bug

* Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data (#1023)

* plumb in memory metrics

* simplify worker::new(), fix some heartbeat metrics, new test file

* CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then()

* Support in-mem metrics when metrics aren't configured

* Move sys_info refresh to dedicated thread, use tuner's existing sys info

* Format, AtomicCell

* Fix unit test

* Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added

* Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test

* recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement

* MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests

* More tests, sticky cache miss, plugins

* Formatting, fix skip_client_worker_set_check

* Cursor found a bug

* Lower sleep time, add print for debugging

* more prints

* use semaphores for worker_heartbeat_failure_metrics

* skip_client_worker_set_check for all integ workers

* Can't use tokio semaphore in workflow code

* use signal to test workflow_slots.last_interval_failure_tasks

* Use Notify instead of semaphores, fix test flake

* Use eventually() instead of a manual sleep

* max_outstanding_workflow_tasks 2

* merge

* Forgot to commit format fixes

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants